Support INT8 Weight-Only Quantization #263

Yuening-wa · 2025-08-23T14:15:30Z

What does this PR do?

Type of change: new feature

Overview: Add support for INT8 weight-only per-channel quantization. The output int8 quantized checkpoint is HuggingFace format and can be directly used in TRTLLM PyTorch workflow.

Usage

# quantize the model by ModelOpt
python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path $model_path \
    --qformat "int8_wo" --kv_cache_qformat "none" \
    --export_fmt hf \
    --export_path $output_path

# do inference in TRTLLM
python3 $trtllm_path/examples/llm-api/quickstart_advanced.py --model_dir $output_path

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

New Features
- Added a weight-only INT8 (int8_wo) quantization option and selectable configuration.
Examples
- Enabled int8_wo in PTQ examples and scripts; validation now accepts int8_wo.
Library
- Exposed int8_wo constant and routing across quantization utilities and configs to support weight-only 8-bit workflows.
Documentation
- Fixed a typo in the quantization guide.
Tests
- Extended PTQ and export tests to include and validate int8_wo.

copy-pr-bot · 2025-08-23T14:15:35Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

codecov · 2025-08-25T08:16:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.76%. Comparing base (ad091e8) to head (4ea627f).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #263   +/-   ##
=======================================
  Coverage   73.76%   73.76%           
=======================================
  Files         171      171           
  Lines       17618    17619    +1     
=======================================
+ Hits        12996    12997    +1     
  Misses       4622     4622

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai · 2025-09-08T15:27:57Z

Walkthrough

Adds a weight-only int8 quantization option ("int8_wo") across configs, utilities, examples, scripts, and tests; updates per-layer quantization decision to choose SQ vs WO based on input-quantizer presence/enabled; fixes a documentation typo.

Changes

Cohort / File(s)	Summary of Changes
Documentation `docs/source/guides/_compress_quantized_models.rst`	Fixed typo: “initaializing” → “initializing”.
Examples & Scripts `examples/llm_ptq/hf_ptq.py`, `examples/llm_ptq/scripts/huggingface_example.sh`	Added `int8_wo` to `QUANT_CFG_CHOICES` and validation/accepted QFORMAT lists; updated CLI/script error messages to include `int8_wo`.
Export/Quant Utilities `modelopt/torch/export/model_config.py`, `modelopt/torch/export/quant_utils.py`	Introduced `QUANTIZATION_INT8_WO`; `_get_quantization_from_layer` now selects between INT8_SQ and INT8_WO for 8-bit weights based on input-quantizer presence/enabled; `to_quantized_weight`/`from_quantized_weight` accept both SQ and WO; minor docs update referencing Int8 WO.
Quantization Config `modelopt/torch/quantization/config.py`	Added `INT8_WEIGHT_ONLY_CFG` (weights quantized to 8-bit, input quantizer disabled) and exported it in available config choices.
Tests `tests/examples/llm_ptq/test_llm_ptq.py`, `tests/gpu/torch/export/test_export.py`, `tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py`	Added PTQ test case for `quant="int8_wo"`; updated export tests to import and assert `INT8_WEIGHT_ONLY_CFG`; added qformat param case for `int8_wo` in safetensors export tests.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as Script/CLI
  participant HF_PTQ as examples/llm_ptq/hf_ptq
  participant Quant as modelopt/torch/export/quant_utils
  participant Export as modelopt/torch/export

  User->>CLI: invoke with QFORMAT=int8_wo
  CLI->>HF_PTQ: validate args (accept int8_wo)
  HF_PTQ->>Quant: request per-layer quant decision
  Note right of Quant #D3E4CD: Decision for 8-bit weights:\nif input_quantizer present & enabled -> INT8_SQ\nelse -> INT8_WO
  Quant->>Export: to_quantized_weight (INT8_SQ / INT8_WO)
  Export-->>HF_PTQ: return quantized artifacts
  HF_PTQ-->>User: produce HF export artifacts (int8_wo)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit nudges code beneath the snow,
New carrot tag: int8_wo.
Weights slim, tests hop and show,
Scripts accept it, exports flow,
Typo mended — onward we go! 🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title concisely and accurately summarizes the main addition of support for INT8 weight-only quantization, matching the PR’s focus on enabling this new quantization mode across code, documentation, and tests.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e07a62 and 4ea627f.

📒 Files selected for processing (9)

docs/source/guides/_compress_quantized_models.rst (1 hunks)
examples/llm_ptq/hf_ptq.py (3 hunks)
examples/llm_ptq/scripts/huggingface_example.sh (1 hunks)
modelopt/torch/export/model_config.py (2 hunks)
modelopt/torch/export/quant_utils.py (5 hunks)
modelopt/torch/quantization/config.py (2 hunks)
tests/examples/llm_ptq/test_llm_ptq.py (1 hunks)
tests/gpu/torch/export/test_export.py (2 hunks)
tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

docs/source/guides/_compress_quantized_models.rst

🚧 Files skipped from review as they are similar to previous changes (6)

modelopt/torch/export/model_config.py
modelopt/torch/quantization/config.py
tests/gpu/torch/export/test_export.py
examples/llm_ptq/hf_ptq.py
tests/examples/llm_ptq/test_llm_ptq.py
examples/llm_ptq/scripts/huggingface_example.sh

🧰 Additional context used

🧬 Code graph analysis (1)

modelopt/torch/export/quant_utils.py (1)

modelopt/torch/quantization/nn/modules/tensor_quantizer.py (1)

is_enabled (389-391)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: linux
GitHub Check: wait-checks / wait
GitHub Check: wait-checks / wait
GitHub Check: build-docs
GitHub Check: code-quality

🔇 Additional comments (7)

tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py (1)

41-41: LGTM! Test coverage added for INT8 weight-only quantization.

The new test parameter for "int8_wo" correctly matches the expected directory suffix and scale fusion flags (all False) for weight-only quantization without input quantizer.
modelopt/torch/export/quant_utils.py (6)
53-53: LGTM! Import addition for INT8 weight-only quantization.

The import of QUANTIZATION_INT8_WO from model_config is correctly added to support the new weight-only INT8 quantization pathway.

458-461: LGTM! Proper logic for distinguishing INT8_SQ vs INT8_WO.

The implementation correctly determines the quantization type based on input quantizer presence and enablement status. When input_quantizer exists and is enabled, it returns INT8_SQ (smooth quantization); otherwise, it returns INT8_WO (weight-only).

633-634: LGTM! Per-layer config mapping for INT8 weight-only.

The addition properly maps "int8_wo" to the W8A16 quantization algorithm in the per-layer configuration processing, maintaining consistency with TensorRT-LLM expectations.

760-761: Verify 3D weight handling was addressed.

The code correctly includes QUANTIZATION_INT8_WO in the INT8 quantization branch. However, based on past review comments, this path had issues with 3D (MoE) weight handling that were reportedly addressed in commits b233ad1 to d3e30ca.

Let me verify that the 3D weight handling fix is actually present:
#!/bin/bash
# Verify that 3D weight handling is properly implemented for INT8 quantization

rg -A 15 -B 5 "if quantization in \[QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO\]:" modelopt/torch/export/quant_utils.py
812-813: Verify 3D weight dequantization was addressed.

Similar to the quantization path, this dequantization branch for QUANTIZATION_INT8_WO should handle 3D MoE weights correctly based on the past review comments.

Let me verify the dequantization implementation:
#!/bin/bash
# Verify that 3D weight dequantization is properly implemented for INT8 quantization

rg -A 10 -B 5 "if quantization in \[QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO\]:" modelopt/torch/export/quant_utils.py
458-461: input_quantizer.is_enabled usage is correct The is_enabled property is defined in TensorQuantizer to return not self._disabled, so the existing check is valid.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (2)

modelopt/torch/quantization/config.py (1)

634-635: Remember to surface this in user-facing docs/tables.

Please add “INT8 Weight-only (W8A16)” to the “Quantization Formats” table to avoid discoverability gaps. I can send a doc patch if you want.

modelopt/torch/export/quant_utils.py (1)

92-122: Unused helper (maybe_transpose_expert_weight_dimensions).

Defined but not used. Either wire it into the packing paths that need it or drop to avoid dead code.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b233ad1 and 12d955e.

📒 Files selected for processing (8)

docs/source/guides/_compress_quantized_models.rst (1 hunks)
examples/llm_ptq/hf_ptq.py (3 hunks)
examples/llm_ptq/scripts/huggingface_example.sh (2 hunks)
modelopt/torch/export/model_config.py (2 hunks)
modelopt/torch/export/quant_utils.py (5 hunks)
modelopt/torch/quantization/config.py (2 hunks)
tests/examples/llm_ptq/test_llm_ptq.py (1 hunks)
tests/gpu/torch/export/test_export.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: linux
GitHub Check: code-quality
GitHub Check: build-docs

🔇 Additional comments (12)

docs/source/guides/_compress_quantized_models.rst (1)

35-35: Typo/clarity fix looks good.

Reads well and aligns with the section intent.

modelopt/torch/export/model_config.py (2)

32-32: Expose INT8 weight-only enum — OK.

Constant addition is consistent with other quant identifiers.

205-205: Comment update matches behavior.

The concat path now rightly mentions Int8 WO alongside SQ/AWQ/NVFP4.

modelopt/torch/quantization/config.py (1)

181-188: Config definition for INT8 weight-only — OK.

Per-channel weights-only with axis=0 and inputs disabled; sensible default “max”.

tests/examples/llm_ptq/test_llm_ptq.py (1)

98-98: Test matrix expanded to int8_wo (HF) — good coverage bump.

tests/gpu/torch/export/test_export.py (2)

60-60: Import of INT8_WEIGHT_ONLY_CFG — OK.

327-328: Block size expectation for INT8 WO = 0 — sane.

examples/llm_ptq/hf_ptq.py (2)

65-65: QUANT_CFG_CHOICES: added int8_wo — OK.

95-106: HF auto-quant allowlist includes int8_wo — OK.

Keeps “int8” out of auto path, as intended.

modelopt/torch/export/quant_utils.py (3)

53-54: New enum import — OK.

458-462: INT8 detection split (SQ vs WO) — OK.

Uses input_quantizer.is_enabled to distinguish; matches config semantics.

633-635: Per-layer config mapping for int8_wo → W8A16 — OK.

coderabbitai · 2025-09-08T15:42:19Z

examples/llm_ptq/hf_ptq.py

+                    "int8",
+                    "int8_wo",
                    "int4_awq",


💡 Verification agent

🧩 Analysis chain

Mismatch with huggingface_example.sh allowlists.

HF path here allows “int8” and “int8_wo”, but the script denies “int8”, causing CLI rejection before Python runs. Update the script allowlists accordingly (see suggested patch in the script file comment).

🏁 Script executed:

#!/bin/bash set -e # Locate the huggingface example script script_path=$(find . -type f -name huggingface_example.sh | head -n1) echo "Found script at: $script_path" # Show lines around allowlist/quantization options echo "=== Context around allowlisted quant types ===" grep -nE 'allowlist|quant' -C3 "$script_path" # Show any explicit int8/int8_wo mentions echo "=== int8 / int8_wo occurrences ===" grep -nE 'int8|int8_wo' -n "$script_path" # Show validation logic (case statements) for quant types echo "=== Case statements for quant type validation ===" grep -nR -n 'case .* in' -C2 "$script_path"

Length of output: 3127

Add “int8” to quantization allowlist in huggingface_example.sh
In examples/llm_ptq/scripts/huggingface_example.sh, update both quant-format case statements (around lines 48 and 77) to include “int8” alongside “int8_wo” and adjust the corresponding error-message lists so that “int8” is accepted.

🤖 Prompt for AI Agents

In examples/llm_ptq/scripts/huggingface_example.sh around the quant-format case statements near lines ~48 and ~77, the script currently accepts "int8_wo" (and other formats) but does not include "int8" in the allowed patterns or the corresponding error-message lists; update both case branches to include "int8" alongside "int8_wo" (e.g., add "int8" as an accepted pattern in each case pattern list) and update the error/usage messages so that they list "int8" as an accepted format where applicable.

coderabbitai · 2025-09-08T15:42:19Z

examples/llm_ptq/scripts/huggingface_example.sh

+        fp16 | bf16 | fp8 | fp8_pc_pt | fp8_pb_wo | int8_wo | int4_awq | nvfp4 | nvfp4_awq | w4a8_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;
        *)
-            echo "Unsupported quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2
+            echo "Unsupported quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int8_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2
            exit 1


coderabbitai · 2025-09-08T15:42:20Z

examples/llm_ptq/scripts/huggingface_example.sh

+    fp8 | fp8_pc_pt | fp8_pb_wo | int8_wo | int8_sq | int4_awq | w4a8_awq | fp16 | bf16 | nvfp4 | nvfp4_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;
    *)
-        echo "Unknown quant argument: Expected one of: [fp8, fp8_pc_pt, fp8_pb_wo, int8_sq, int4_awq, w4a8_awq, fp16, bf16, nvfp4, nvfp4_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2
+        echo "Unknown quant argument: Expected one of: [fp8, fp8_pc_pt, fp8_pb_wo, int8_wo, int8_sq, int4_awq, w4a8_awq, fp16, bf16, nvfp4, nvfp4_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2
        exit 1


🛠️ Refactor suggestion

General quant allowlist also needs ‘int8’ or guard by export_fmt.

Currently this block runs for HF too and rejects “int8”. Minimal fix: add “int8”.

Apply:

- fp8 | fp8_pc_pt | fp8_pb_wo | int8_wo | int8_sq | int4_awq | w4a8_awq | fp16 | bf16 | nvfp4 | nvfp4_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;; + fp8 | fp8_pc_pt | fp8_pb_wo | int8 | int8_wo | int8_sq | int4_awq | w4a8_awq | fp16 | bf16 | nvfp4 | nvfp4_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;

Alternative (cleaner): wrap this whole validation in if [ "$EXPORT_FORMAT" != "hf" ]; then ... fi so HF path is validated only once.

🤖 Prompt for AI Agents

In examples/llm_ptq/scripts/huggingface_example.sh around lines 77-80, the quant argument validation excludes "int8" and incorrectly rejects it for the HF path; either add "int8" to the list of allowed quant values in the case pattern or, preferably, wrap this entire validation block in a guard so it only runs when EXPORT_FORMAT is not "hf" (i.e., surround the case...esac with if [ "$EXPORT_FORMAT" != "hf" ]; then ... fi), ensuring HF exports are not blocked and that "int8" remains accepted where appropriate.

modelopt/torch/export/quant_utils.py

coderabbitai · 2025-09-08T15:42:20Z

modelopt/torch/export/quant_utils.py

+    if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]:
        return weight.to(torch_dtype) * weights_scaling_factor[:, None].to(torch_dtype)



⚠️ Potential issue

INT8 unpack path also misses 3D (MoE) weights.

Mirror the 3D case to restore full-precision weights correctly.

Apply:

- if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]: - return weight.to(torch_dtype) * weights_scaling_factor[:, None].to(torch_dtype) + if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]: + if weight.dim() == 3: + return weight.to(torch_dtype) * weights_scaling_factor.unsqueeze(-1).to(torch_dtype) + elif weight.dim() == 2: + return weight.to(torch_dtype) * weights_scaling_factor[:, None].to(torch_dtype) + else: + raise NotImplementedError("INT8 dequantization expects 2D or 3D weight tensors")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]:

return weight.to(torch_dtype) * weights_scaling_factor[:, None].to(torch_dtype)

if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]:

if weight.dim() == 3:

return weight.to(torch_dtype) * weights_scaling_factor.unsqueeze(-1).to(torch_dtype)

elif weight.dim() == 2:

return weight.to(torch_dtype) * weights_scaling_factor[:, None].to(torch_dtype)

else:

raise NotImplementedError("INT8 dequantization expects 2D or 3D weight tensors")

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/examples/llm_ptq/test_llm_ptq.py (1)
98-98: Explicitly set KV-cache quantization to none for INT8 weight-only HF export

Aligns the test with the documented usage and avoids relying on defaults that might change.
-        PTQCommand(quant="int8_wo", export_fmt="hf"),
+        PTQCommand(quant="int8_wo", export_fmt="hf", kv_cache_quant="none"),

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 12d955e and 1b8036e.

📒 Files selected for processing (8)

docs/source/guides/_compress_quantized_models.rst (1 hunks)
examples/llm_ptq/hf_ptq.py (3 hunks)
examples/llm_ptq/scripts/huggingface_example.sh (2 hunks)
modelopt/torch/export/model_config.py (2 hunks)
modelopt/torch/export/quant_utils.py (5 hunks)
modelopt/torch/quantization/config.py (2 hunks)
tests/examples/llm_ptq/test_llm_ptq.py (1 hunks)
tests/gpu/torch/export/test_export.py (2 hunks)

✅ Files skipped from review due to trivial changes (1)

docs/source/guides/_compress_quantized_models.rst

🚧 Files skipped from review as they are similar to previous changes (6)

modelopt/torch/quantization/config.py
examples/llm_ptq/scripts/huggingface_example.sh
modelopt/torch/export/model_config.py
examples/llm_ptq/hf_ptq.py
tests/gpu/torch/export/test_export.py
modelopt/torch/export/quant_utils.py

🧰 Additional context used

🧬 Code graph analysis (1)

tests/examples/llm_ptq/test_llm_ptq.py (1)

tests/_test_utils/ptq_utils.py (1)

PTQCommand (28-79)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: linux
GitHub Check: wait-checks / wait
GitHub Check: code-quality
GitHub Check: build-docs

tests/examples/llm_ptq/test_llm_ptq.py

Edwardf0t1

Thanks @Yuening-wa for adding int8-wo support. 👍

Do we how the accuracy and perf look like compared with int8-sq?

Yuening-wa · 2025-09-25T07:27:59Z

Thanks @Yuening-wa for adding int8-wo support. 👍

Do we how the accuracy and perf look like compared with int8-sq?

Thanks @Edwardf0t1 for the review. Here is the accuracy comparison with BF16 of Qwen3-30B-A3B model on MMLU and GSM8K.

Qwen3-30B-A3B	BF16	INT8 weight-only
MMLU	79.58	79.45
GSM8K	85.90	85.67

cjluo-nv

Overall LGTM. The benefit of int8wo vs int8_sq is that int8wo has TRTLLM torch backend support.

Before approval, could you add this mode into the following tests:

https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/tests/examples/llm_ptq/test_llm_ptq.py
https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py

Signed-off-by: Yuening Li <[email protected]>

Yuening-wa · 2025-09-26T07:45:05Z

Thanks for the comments @cjluo-nv. Added int8wo mode into these two tests.

Signed-off-by: Yuening Li <[email protected]> Signed-off-by: Ye Yu <[email protected]>

Yuening-wa requested review from Edwardf0t1, RalphMao, ajrasane, cjluo-nv, kinjalpatel27, meenchen, realAsma, sugunav14, i-riyad, jingyu-ml, kaix-nv, mxinO and sychen52 as code owners August 23, 2025 14:15

Yuening-wa force-pushed the user/yueningl/support_int8_wo_quantization branch from 4f99116 to aa960ea Compare August 25, 2025 08:04

Yuening-wa force-pushed the user/yueningl/support_int8_wo_quantization branch from aa960ea to d6f8908 Compare August 25, 2025 08:27

kevalmorabia97 requested review from a team as code owners September 2, 2025 14:29

kevalmorabia97 removed request for RalphMao, i-riyad, kaix-nv, meenchen, kinjalpatel27, sychen52, jingyu-ml, ajrasane and mxinO September 4, 2025 05:35

Yuening-wa force-pushed the user/yueningl/support_int8_wo_quantization branch from d6f8908 to d989313 Compare September 8, 2025 15:27

Yuening-wa force-pushed the user/yueningl/support_int8_wo_quantization branch from d989313 to 12d955e Compare September 8, 2025 15:28

coderabbitai bot reviewed Sep 8, 2025

View reviewed changes

Yuening-wa force-pushed the user/yueningl/support_int8_wo_quantization branch from 12d955e to 1b8036e Compare September 10, 2025 09:06

coderabbitai bot reviewed Sep 10, 2025

View reviewed changes

tests/examples/llm_ptq/test_llm_ptq.py Outdated Show resolved Hide resolved

Edwardf0t1 reviewed Sep 10, 2025

View reviewed changes

Yuening-wa force-pushed the user/yueningl/support_int8_wo_quantization branch 3 times, most recently from d444ca9 to 4e07a62 Compare September 25, 2025 11:41

cjluo-nv reviewed Sep 26, 2025

View reviewed changes

support int8 weight-only quantization

4ea627f

Signed-off-by: Yuening Li <[email protected]>

Yuening-wa force-pushed the user/yueningl/support_int8_wo_quantization branch from 4e07a62 to 4ea627f Compare September 26, 2025 07:42

cjluo-nv approved these changes Sep 26, 2025

View reviewed changes

kevalmorabia97 enabled auto-merge (squash) September 26, 2025 16:27

kevalmorabia97 merged commit d649fb8 into NVIDIA:main Sep 26, 2025
33 of 37 checks passed

yeyu-nvidia pushed a commit that referenced this pull request Oct 1, 2025

Support INT8 Weight-Only Quantization (#263)

1b4778a

Signed-off-by: Yuening Li <[email protected]> Signed-off-by: Ye Yu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support INT8 Weight-Only Quantization #263

Support INT8 Weight-Only Quantization #263

Uh oh!

Yuening-wa commented Aug 23, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Aug 23, 2025

Uh oh!

codecov bot commented Aug 25, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Sep 8, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 8, 2025

Uh oh!

coderabbitai bot Sep 8, 2025

Uh oh!

coderabbitai bot Sep 8, 2025

Uh oh!

Uh oh!

coderabbitai bot Sep 8, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Edwardf0t1 left a comment

Uh oh!

Yuening-wa commented Sep 25, 2025

Uh oh!

cjluo-nv left a comment

Uh oh!

Yuening-wa commented Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!

		if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]:
		return weight.to(torch_dtype) * weights_scaling_factor[:, None].to(torch_dtype)

Support INT8 Weight-Only Quantization #263

Support INT8 Weight-Only Quantization #263

Uh oh!

Conversation

Yuening-wa commented Aug 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Aug 23, 2025

Uh oh!

codecov bot commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

Yuening-wa commented Sep 25, 2025

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Yuening-wa commented Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!

Yuening-wa commented Aug 23, 2025 •

edited by coderabbitai bot

Loading

codecov bot commented Aug 25, 2025 •

edited

Loading

coderabbitai bot commented Sep 8, 2025 •

edited

Loading